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The detection of similarities between long DNA and protein sequences is studied using concepts 
of statistical physics. It is shown that mutual similarities can be detected by sequence alignment 
methods only if their amount exceeds a threshold value. The onset of detection is a critical phase 
transition viewed as a localization-delocalization transition. The fidelity of the alignment is the 
order parameter of that transition; it leads to criteria to select optimal alignment parameters. 
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Evaluating the similarities between long strings of al- 
phabet is a challenging task, which arises in many fields 
ranging from data processing to biology [1]. Standard 
applications involve the comparisons of copies of a mes- 
sage sequence blurred by an imperfect transmission or 
reproduction process. A particularly important exam- 
ple is evolution in biological systems, a process that mu- 
tates gene sequences in various ways including local sub- 
stitutions, insertions, and deletions. Molecular biologists 
routinely compare newly sequenced genes to known ones 
in databases, a first step towards unveihng the struc- 
ture and function of the new findings. This so-called 
sequence alignment is the most widely used mathemati- 
cal/computational tool in molecular biology [2]. 

A (global) alignment of two sequences {Pi} and {Qj} 
is defined as an ordered set of pairings {Pi,Qj) and of 
unpaired elements (Pi,—) and {—,Qj) called gaps, each 
letter Pi and Qj belonging to exactly one pairing or gap 
(see Fig. 1). The optimal alignment of the two sequences 
is determined by minimization of a cost or "energy" func- 
tion E favoring pairs of matching elements {Pi = Qj) 
over mismatches {Pi ^ Qj) and gaps. A simple and com- 
monly used energy function is the sum over all matches, 
mismatches, and gaps of the alignment contributing the 
respective energies —1, /x > 0, and S > each [3]. Many 
more general energy functions have been discussed [2]. 

{P} A A — A B A B B match(-i) 

/ S S S S mismatch (n) 

{Q} A B BAB A — B — gap(8) 

FIG. 1. One possible alignment of two binary sequences, 
{Pi} = AAABABB and {Qj} = ABBABAB, with six pair- 
ings (five matches, one mismatch) and two gaps. 

Clearly, the optimal ahgnment depends strongly on 
these energy parameters, and so does its fidelity, i.e., the 
extent to which it captures mutual correlations between 
the sequences compared. In particular, if the evolution 
process involves random insertions and deletions, one has 



to allow for gaps in the optimal alignment so that mutu- 
ally correlated regions of the sequences can actually align. 
For gene sequences, a more stringent criterion is the bi- 
ological relevance of an alignment, that is, the extent to 
which the matched regions actually indicate functional 
similarities between different proteins. Finding align- 
ment parameters that lead to high relevance is a diffi- 
cult problem, which has not been solved sj^stcmatically 
so far. For biological applications, "optimal" parameters 
for various types of genes are often deduced empirically 
from sequence pairs whose functional alignments are al- 
ready known [4]. 

In this Letter, we apply ideas and methods of statis- 
tical physics to sequence alignment, introducing a dif- 
ferent conceptual approach towards the parameter selec- 
tion problem. Using simple stochastic models of evolu- 
tion, we mutate an ancestor sequence to obtain daughter 
sequences {Pi} and {Qj} with well-defined mutual cor- 
relations, namely, the ensemble of all pairs (Pj = Qj) 
of unmutated daughters of the same ancestor element. 
These sequences are then aligned using a given energy 
function E, and the fidelity of the optimal alignment is 
quantified as the fraction of correctly recovered such pairs 
{Pi = Qj). For long sequences, we find that the mutual 
correlations cannot be detected if their amount is below a 
threshold value. A critical transition separates this low- 
similarity phase of zero fidelity from the high- similarity 
phase where the fidelity is finite. (This phase transition 
is distinct from the "transition" to the so-called local 
alignment regime discussed in the literature [5].) 

Our analysis of the phase transition is based on the 
known representation [6] of an alignment of two se- 
quences {Pi}, {Qj} on the two-dimensional lattice of 
Fig. 2. The cells of this lattice are labeled by the index 
pairs {i,j), or alternatively by the rotated coordinates 
t = i -\- j and r = i — j. The bonds encode the adja- 
cency of letters: the diagonal bond in cell {i,j) represents 
the pairing {Pi,Qj); horizontal and vertical bonds corre- 
spond to gaps {Pi,—) and {—,Qj), respectively. Thus any 
alignment maps onto a lattice path that is directed along 
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the t coordinate, i.o., givon by a uniquo function r{t). 
For daughter sequences generated from our stochastic 
evolution model, mutual similarities can be represented 
on the lattice by a path R{t) joining all correlated pairs 
{Pi = Qj). We call such a path the "target path". For 
long sequences, it induces a morphological transition on 
the optimal alignment path ro{t): in the low-similarity 
phase, this path is super- diffusive with typical fluctua- 
tions 6r{t) = ro{t)-R{t) scaling as \Sr{t)\ ~ i^/^, while in 
the high-similarity phase, it is localized to the target path 
with finite fiuctuations |^r(i)| ~ see Fig. 3. Hence, 
only in the high-similarity phase is the path ro (i) a faith- 
ful approximation of the target R{t). The fidelity of the 
alignment is then simply given by the overlap of the two 
paths, i.e., by their number of intersections per unit of t. 
For long sequences, the overlap is proportional to the in- 
verse localization length maximization of this "order 
parameter" gives a numerically and analytically accessi- 
ble criterion for the choice of alignment parameters. 

{P,} 
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FIG. 2. The directed path (thick line) corresponding to the 
alignment in Fig. 1. Horizontal and vertical lattice bonds 
represent gaps, diagonal bonds represent pairings with bond 
energies Vr,t = — (J ± A) for matches (fuU-wiggly lines) and 
mismatches (dashed- wiggly lines), respectively. 
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FIG. 3. Typical large-scale configurations of the opti- 
mal alignment path ro{t) for long sequences, (a) In the 
high-similarity pliase, tliis patli (full line) is localized to the 
target patli R{t) (dashed line) with typical transverse dis- 
placement of size (b) In the low-similarity phase, it fluc- 
tuates independently of the target path. 



To establish these results, we restrict ourselves here 
to binary sequences (with elements Pi,Qj G {-1-1,— 1}), 
and to the simple model introduced above with just two 
parameters // and S (generalizations are briefiy discussed 
at the end of this Letter) . With a convenient shift in the 
energy function, we write the total energy E of any path 
r{t) as the sum over all its diagonal bonds with bond 
energies Vr,t = - J - APi={r+t)/2 Qj={r-t)/2] horizontal 
and vertical bonds take the energy 0. The parameters 
J = 26 - {p - l)/2 and A = {p + l)/2 are the effective 
gap and mismatch costs; they characterize the stiffness of 
the alignment paths and the site-to-site variations of the 
pairing potential Vr,t, respectively. The lowest energy 
path ro{t) depends on fi and d only via the ratio g = 
A/ J. We shall restrict our analysis to the biologically 
interesting limit of ^ < 1 or 25 > [8]. 

To model the behavior of typical alignment paths in 
the low- similarity phase, we take the sequences {Pi} and 
{Qj} from an ensemble of unbiased random sequences 
(i.e.. Pi = Qj = 0, PiPi' = Si^ i', and QjQj' = Sjji) with 
no mutual correlations, PiQj = 0. (Averages over this 
ensemble are denoted by overbars.) The pairing potential 
Vr^t then becomes a random potential with average = 
— J and variance 



Vr,tVr',t'-J =A P(r+t)/2P{r'+t')/2Q(r-t)/2Q(r'-t')/2 
= A^Sr,r'St,t' . (1) 

It induces random fluctuations on the afignment paths. 

The large-scale statistics of these fluctuations can be 
derived from the partition function of the alignment 
paths [7] in a path integral representation [9]. We 
show [8] that the continuum "action" of this path in- 
tegral takes the form [10] 

5 = / dr [| p2 + ^(p{r),T) + 0{p\np\ •■•)]• (2) 

It describes a "coarse-grained" alignment path p(r) with 
a flnite line tension 7 (and p = dp/dr) in a coarse-grained 
rando m potential rjjp , r) characterized by its second mo- 
ment ri{p,T)ri{p' ,t') ~ A^(5(p — p')5{t — r'). All other 
short-ranged moments [11], as well as the terms omit- 
ted in (2), are irrelevant variables. We conclude that the 
large-distance behavior of afignments is governed by the 
well-known universality class of a directed path in a two- 
dimensional Gaussian random potential [9] . Many prop- 
erties of this universality class are known exactly. Typi- 
cal fluctuations of the optimal path ro {t) are (asymptot- 
ically) super-diffusive, (ro(t) - ro(i'))^ ~ A |i - re- 
flecting the tendency of the optimal alignment to use gaps 

to gain an excess number of matches over mismatches. 

2 

This goes along with the variance Eq{N) — Eo{N) ~ 

B N'^^^ of the optimal energy Eq {N) for paths of length 

A'' ^ 1. We have verified these properties numerically; an 

example is shown in Fig. 4. The nonuniversal amplitudes 
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A and B have also been obtained. Scaling arguments, 
which are supported by our numerics, yield A ~ g^^"^ and 
B ~ 5"/^ in the biologically relevant limit -C 1 [8] . 
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FIG. 4. Energy variations (ABo) = -Bg(t) - Eo{t) 
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of the optimal alignments. The results are obtained from a 
sample of 100 pairs of uncorrelated random sequences for each 
set of the alignment parameters listed. 



To study the high-similarity phase, we construct se- 
quences {Pi} and {Qj} with mutual correlations as they 
would arise if these sequences developed from a common 
ancestor sequence (again taken to be an unbiased random 
sequence) through independent mutations over a time 
9. Consider first evolution through point substitutions 
only: on each sequence, elements are randomly chosen 
with rate T and replaced by an unbiased random num- 
ber ±1. Then {Pi} and {Qj} remain unbiased random 
sequences and have mutual correlations PiQj = p'^Sij, 
where p = exp(— F^) is the fraction of elements on each 
sequence that are still unaffected by the mutations after 
a time 9. It follows that the pairing potential still has the 
variance (1) (up to an irrelevant [8] term ~ dt^t'Sr,-r'), 
but the average is now Vr^t = —J — U^rfi'- mutual correla- 
tions generate a target path R{t) = (along the diagonal 
of the alignment lattice) that attracts the alignment paths 
with strength U = gp^ J. This interaction refiects the ex- 
cess number of matches (P, = Qi) of unmutated daugh- 
ter elements of the ancestor sequence. It turns out to 
change the optimal alignment drastically [13]: The path 
ro{t) becomes localized, i.e., its mean square displacement 
from the target path rl{t) grows no longer superdiffu- 
sively, but reaches a finite value = limf^oo Tq (t) for 
long sequences. In recent years, this localization has been 
studied carefully in the context of fiux pinning in type-II 
superconductors [13]. It is governed by the competition 
of the potential energy ~ U/^± per unit of t gained from 
the overlap with the target path, and the random energy 
cost V/£,± (with a constant F ~ 5^ J for g^ < 1 [8]) due 
to the confinement of the alignment path [12]. (This is 
very similar to the competition of potential and kinetic 



energy determining the localization of a quantum parti- 
cle in a potential well.) In the limit p = 1 of identical 
sequences, the optimal alignment is obviously ro{t) = 0; 
the path is tightly bound to the target. As p decreases, 
the fiuctuations around the target increase, and the over- 
lap decreases. It is found [13] that the path ro{t) remains 
localized to the target even for an arbitrarily weak at- 
traction U > 0, although the localization length becomes 
very large for C7 <^V, 



exp(bV/U) 



(3) 



(with a universal constant b « 2.8). Thus the fidelity of 
the optimal alignment, ^J^, tends to zero with vanishing 
U; this signals a continuous transition to the superdiffu- 
sive behavior at U = 0. 

We now turn to a more realistic model of evolution 
that includes local insertions and deletions: we generate 
the sequences {P,} and {Qj} from a common ancestor 
through point substitutions as before, then we randomly 
choose a fraction p <C 1 of the sites on each sequence, 
and insert or delete a random element ±1 at the chosen 
sites. This modifies the mutual correlations. 



PiQj = P k-j,R{i+3) ' 

and hence the pairing potential Vr,t 



(4) 



-J - US, 



'r,R(t)- 



The target path R(t) is no longer along the diagonal 
of the ahgnment lattice, but along the trajectory of a 
Gaussian rando m walk with mean square displacement 
{R{t) - R{t')y = p\t-t'\ (see Fig. 3). Since a fraction 
2p of the target path involve gaps, the attractive strength 
of the target is shifted to 



U = [{gp''{l-p)-2p)]J , 



(5) 



and now changes sign at the threshold p^ = pi = 
Ipjg -I- Oip"^). Above the threshold (p^ > p^), the in- 
teraction remains attractive {U > 0). In this phase, the 
optimal alignment path turns out to be localized to the 
target in much the same way as before [13], with a finite 
mean square displacement = limt^oo (''o(^) — -R(^))^ 
given by (3) and (5) and a finite fidelity ~ ■ As 
approaches pi from above, the fidelity again tends to zero 
continuously. For p"^ < pI, the interaction becomes re- 
pulsive, and the path ro(i) is again super-diffusive. This 
is the low-similarity phase, where the optimal alignment 
does not refiect the mutual correlations: its fidelity is 
zero. The singular behavior of the fidelity given by (3) 
can be observed for sequences whose length is larger than 
the correlation length ^|| ~ S,^"^', the fidelity for shorter 
sequences is described in [8]. 

The concept of an order parameter is useful for the 
selection of optimal ahgnment parameters /x and S. As 
follows from Eqs. (3), (5), and V ~ g^J, the correlations 
(4) with given parameters p and p can be detected only 
for g > gc{p,p) = 2p/p'^ + 0{p^). They are recovered best 
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if g takos the value g*{p,p) = 4p/p'^ + 0{p^) obtained by 
maximizing for fixed p and p <^ p^. This is a linear 
condition on fi and S. 

There are a number of related alignment issues rele- 
vant to applications in biology, where our results apply 
in a similar way; a detailed account will be published 
elsewhere [8] . (i) For alphabets with a larger character k 
{k = 4 for DNA and A; = 20 for proteins), the effec- 
tive strength of attraction of the target path increases, 
substantially decreasing the locahzation and correlation 
lengths, (ii) A higher biological relevance of the align- 
ment is often achieved if gap initiations are penalized by 
a higher energy than gap extensions. Such refinements 
lead to alignment paths that have, in addition to their 
line tension 7, a finite "bending rigidity" . (iii) The ahgn- 
ment of n sequences is described by a directed alignment 
path (ri, . . . , r„_i)(t) in n — 1 transversal dimensions. 
For n > 2, the detection threshold is increased to finite 
U > 0, making similarity-detection more difficult. The 
divergence of the localization length close to the tran- 
sition is given there by a power law instead of Eq. (3). 
(iv) The same is true even for the alignment of two se- 
quences if there are long-ranged intra-sequence correla- 
tions that fall off sufficiently slowly (as may be the case 
for the non-coding regions of the genome), (v) Unlike 
the global algorithms discussed so far, local afignment al- 
gorithms match only a contiguous piece of sequence {Pi} 
with a different piece of sequence {Qj}; they are appro- 
priate to finding mutual similarities that exist only within 
these two pieces. It has been noted [5] that the regimes of 
local and global alignment are separated by a transition 
line in the space of parameters {fi,S). That transition 
is quite different from the transition described in this 
Letter; it can be understood [8] as a boundary-induced 
critical phenomenon analogous to wetting transitions. 

In summary, we have described a unique approach to 
similarity detection, identifying sequence alignment al- 
gorithms with physics problems defined on a lattice with 
quenched disorder. We show that the successful detection 
of correlations between sequences depends on the kind of 
mutations they undergo, as well as on the specific choice 
of the alignment parameters. This is demonstrated for 
simple stochastic mutation processes modeling biological 
evolution. In such systems, correlations can only be de- 
tected if their amount exceeds a threshold value; the on- 
set of detection is shown to be a critical phase transition 
with universal characteristics. Most importantly, it is the 
fluctuation statistics at this transition that determines 
the fidelity of the optimal alignment. Using that "order 
parameter" , we derive criteria for the optimal choice of 
alignment parameters given a limited knowledge of the 
mutation process (in contrast to the current approach 
of finding these parameters empirically) . Conversely, the 
empirical knowledge of optimal alignment parameters for 
a given class of proteins can be used to infer the nature 
of mutations suffered by those proteins. 
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